fix(pb, gas): fix overreporting slots bug, metrics wf not stopping bug, corrupt history preventing wf pull by MasterPtato · Pull Request #4148 · rivet-dev/rivet

MasterPtato · 2026-02-06T21:50:09Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

railway-app · 2026-02-06T21:50:18Z

🚅 Deployed to the rivet-pr-4148 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
website	❌ Build Failed (View Logs)	Web	Feb 6, 2026 at 9:52 pm
mcp-hub	✅ Success (View Logs)	Web	Feb 6, 2026 at 9:51 pm
frontend-inspector	❌ Build Failed (View Logs)	Web	Feb 6, 2026 at 9:51 pm
frontend-cloud	❌ Build Failed (View Logs)	Web	Feb 6, 2026 at 9:50 pm

MasterPtato · 2026-02-06T21:50:32Z

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-02-06T21:52:02Z

PR Review: Fix overreporting slots bug, metrics workflow not stopping bug, corrupt history preventing workflow pull

Summary

This PR addresses three critical bugs in the Pegboard and Gasoline packages:

Overreporting slots bug - Fixed serverless slot accounting
Metrics workflow not stopping - Added destroy signal when actor is destroyed early
Corrupt history preventing workflow pull - Changed error handling to gracefully skip corrupted workflows instead of failing

Code Quality & Best Practices

✅ Strengths

Error handling improvement: Converting ensure! to graceful error logging in gasoline/src/db/kv/mod.rs is a pragmatic approach that prevents workflow system deadlock due to corrupt history
Clear comments: The new comment in runtime.rs:1115-1117 clearly explains the reasoning behind conditional slot clearing
Structured logging: Using tracing::error!(?wf, ...) follows the structured logging pattern from CLAUDE.md
Improved error messages: More specific error messages ("indexed name", "indexed chunk") help with debugging

⚠️ Concerns & Suggestions

1. Slot Accounting Logic (runtime.rs:1135-1138)

// Only mark allocated_serverless_slot as false if it was allocated before and cleared now
if allocated_serverless_slot && cleared {
    state.allocated_serverless_slot = false;
}

Issue: The previous logic unconditionally set allocated_serverless_slot = false, which could have been causing the slot overreporting. This fix is correct, but the logic is subtle.

Consideration: Should there be logging when we skip clearing the slot? This would help diagnose future slot accounting issues:

if allocated_serverless_slot && cleared {
    state.allocated_serverless_slot = false;
} else if allocated_serverless_slot && !cleared {
    tracing::debug!(actor_id=?input.actor_id, "pending allocation already cleared, keeping slot allocation");
}

2. Silent Data Corruption Handling (kv/mod.rs:1635-1640)

if current_event.indexed_names.len() != key.index {
    tracing::error!(?wf, "corrupt history, indexed name doesn't exist yet or is out of order");
    return Ok(None);
}

Issue: While returning Ok(None) prevents system failure, it silently discards the entire workflow pull attempt. This could lead to workflows getting stuck.

Recommendations:

Add metrics/alerting for corrupt history detection
Consider including ?current_event and key_index=?key.index in the error log for better debugging
Document the recovery path - does returning None here cause a retry, or is the workflow permanently blocked?

Example improvement:

if current_event.indexed_names.len() != key.index {
    tracing::error!(
        ?wf,
        expected_index = key.index,
        actual_len = current_event.indexed_names.len(),
        ?current_event,
        "corrupt history: indexed name doesn't exist yet or is out of order"
    );
    // TODO: Add metric for corrupt history detection
    return Ok(None);
}

3. Metrics Workflow Signal (mod.rs:329-335)

runtime::SpawnActorOutput::Destroy => {
    ctx.v(2)
        .signal(metrics::Destroy { ts: util::timestamp::now() })
        .to_workflow_id(metrics_workflow_id)
        .send()
        .await?;

Question: Is there a risk that metrics_workflow_id might not exist or be invalid when the actor is destroyed early? Should there be error handling for the signal send?

4. Test Runner Environment Variables (test-runner/src/index.ts)
The changes to environment variable handling are improvements, but:

parseInt(process.env.RIVET_RUNNER_TOTAL_SLOTS ?? "1") is good, but consider using a more realistic default than 1 slot
String comparison == "1" works but is fragile. Consider: process.env.AUTOSTART_SERVER !== "0" for boolean flags
Route changes from /start → /api/rivet/start look like API versioning. Ensure clients are updated accordingly.

5. Log Level Change (runtime.rs:519)

-tracing::warn!(actor_id=?input.actor_id, "failed to allocate (no availability), waiting for allocation");
+tracing::debug!(actor_id=?input.actor_id, "failed to allocate (no availability), waiting for allocation");

Question: Is this expected to happen frequently during normal operation? If so, debug is appropriate. If it indicates a capacity problem, warn might be better for alerting.

Potential Bugs

Workflow Recovery Path Unclear: When corrupt history is detected and Ok(None) is returned, what happens to the workflow? Is it retried? Does it need manual intervention?
Missing Error Handling: The signal send in mod.rs:334 uses ? to propagate errors, which will fail the entire actor workflow if the metrics workflow doesn't exist or can't receive signals.

Performance Considerations

No significant performance concerns. The changes are mostly bug fixes with minimal performance impact. The graceful handling of corrupt history actually improves system resilience.

Security Concerns

No security issues identified. The changes are focused on internal workflow state management.

Test Coverage

Missing Tests:

Test for slot accounting bug - verify that allocated_serverless_slot is correctly maintained when clear_pending_allocation is called with various states
Test for corrupt history handling - verify system continues operating when history corruption is detected
Test for metrics destroy signal - verify signal is sent when actor is destroyed early

Recommendation: Add unit tests for the slot accounting logic and integration tests for the corrupt history scenario.

Additional Notes

Documentation: Consider updating documentation about:
- Recovery procedures for corrupt workflow history
- Slot accounting semantics
- Metrics workflow lifecycle
Follow-up Tasks:
- Add monitoring/alerting for corrupt history detection
- Investigate root cause of history corruption
- Audit other uses of ensure! that might benefit from similar graceful handling
PR Description: The PR template wasn't filled out. Please update with:
- Description of each bug and root cause
- Testing performed to verify fixes
- Migration/rollout considerations

Verdict

Approve with minor suggestions ✅

The fixes address real production issues and the approach is sound. The main concern is the silent failure on corrupt history - ensure there's visibility into when this happens and a clear recovery path.

Review generated by Claude Code - focusing on code quality, potential bugs, and maintainability

…g, corrupt history preventing wf pull

graphite-app · 2026-02-06T22:59:51Z

Merge activity

Feb 6, 10:59 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
Feb 6, 11:00 PM UTC: CI is running for this pull request on a draft pull request (#4150) due to your merge queue CI optimization settings.
Feb 6, 11:01 PM UTC: Merged by the Graphite merge queue via draft PR: #4150.

…g, corrupt history preventing wf pull (#4148) # Description Please include a summary of the changes and the related issue. Please also include relevant motivation and context. ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update ## How Has This Been Tested? Please describe the tests that you ran to verify your changes. ## Checklist: - [ ] My code follows the style guidelines of this project - [ ] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes

MasterPtato requested a review from NathanFlurry February 6, 2026 21:50

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4148 February 6, 2026 21:50 Destroyed

fix(pb, gas): fix overreporting slots bug, metrics wf not stopping bu…

53641a6

…g, corrupt history preventing wf pull

MasterPtato force-pushed the 02-06-fix_pb_gas_fix_overreporting_slots_bug_metrics_wf_not_stopping_bug_corrupt_history_preventing_wf_pull branch from 1c1aec0 to 53641a6 Compare February 6, 2026 22:55

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4148 February 6, 2026 22:55 Destroyed

NathanFlurry approved these changes Feb 6, 2026

View reviewed changes

graphite-app bot closed this Feb 6, 2026

graphite-app bot deleted the 02-06-fix_pb_gas_fix_overreporting_slots_bug_metrics_wf_not_stopping_bug_corrupt_history_preventing_wf_pull branch February 6, 2026 23:01

claude bot mentioned this pull request Feb 6, 2026

[Graphite MQ] Draft PR GROUP:spec_6358ce (PRs 4148) #4150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pb, gas): fix overreporting slots bug, metrics wf not stopping bug, corrupt history preventing wf pull#4148

fix(pb, gas): fix overreporting slots bug, metrics wf not stopping bug, corrupt history preventing wf pull#4148
MasterPtato wants to merge 1 commit intomainfrom
02-06-fix_pb_gas_fix_overreporting_slots_bug_metrics_wf_not_stopping_bug_corrupt_history_preventing_wf_pull

MasterPtato commented Feb 6, 2026

Uh oh!

railway-app bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

MasterPtato commented Feb 6, 2026

Uh oh!

claude bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

graphite-app bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasterPtato commented Feb 6, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

railway-app bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Feb 6, 2026

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Fix overreporting slots bug, metrics workflow not stopping bug, corrupt history preventing workflow pull

Summary

Code Quality & Best Practices

✅ Strengths

⚠️ Concerns & Suggestions

Potential Bugs

Performance Considerations

Security Concerns

Test Coverage

Additional Notes

Verdict

Uh oh!

graphite-app bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

railway-app bot commented Feb 6, 2026 •

edited

Loading

claude bot commented Feb 6, 2026 •

edited

Loading

graphite-app bot commented Feb 6, 2026 •

edited

Loading